How much homophony should we expect for a given wordform, as a function of its phonotactics and word length? Is this expected homophony more or less than the observed homophony for that wordform? And critically: does this Homophony Delta vary across wordforms as a function of Wordform Frequency?
That is, if there is a pressure against homophony in real lexica (Trott & Bergen, 2020), that pressure might manifest unevenly across the lexicon. Based on previous work, it is possible that the pressure against homophony is actually weakest for long wordforms, as well as phonotactically implausible wordforms. Why would this be the case? It seems counterintuitive to suppose that a lexicon designed for efficiency would not optimize the use of its wordforms by concentrating homophones among the shortest, most phonotactically plausible wordforms.
One explanation is that word length is correlated with frequency. A negative selection pressure might be strongest among wordforms for which, if ambiguous, would require very frequent disambiguation. If this interpretation is correct, Homophony Delta should correlate negatively with Frequency.
Again, this prediction appears counterintuitive, given the extensive evidence that wordform frequency is positively correlated with ambiguity, whether measured in terms of Number of homophones or Number of senses (Zipf, 1945; Piantadosi et al, 2012). But this observation would ultimately still be consistent with a mechanism selecting against homophones in the most frequent wordforms. The point is not that frequent wordforms should have fewer homophones than long wordforms; rather, frequent wordforms should have fewer homophones than one would expect on account of their phonotactics than length, while this gap, or Homophony Delta, should be smaller (or even inverted) for infrequent wordforms.
Given \(M\) meanings to express, and \(W\) wordforms with which to express them, each associated with some probability \(p_i\), how many meanings should each wordform \(w_i\) carry as a function of its normalized probability \(p_i\)?
We also define several helper functions to simplify the data processing pipeline.
First, we need a function that reads in an aggregated lexicon (i.e., across homophonous wordforms), as well as the original lexicon with original entries intact; this function will then identify the corrected number of meanings M per word length, and merge that information with the aggregated lexicon.
read_data = function(agg_path) {
# Read in aggregated data
df_agg = read_csv(agg_path)
# Below, we identify the number of *words* of each word length.
df_total_counts = df_agg %>%
mutate(k_actual = num_homophones + 1) %>%
group_by(num_sylls_est) %>%
summarise(M = sum(k_actual))
# Merge with counts
df_agg = merge(df_agg, df_total_counts, by = "num_sylls_est")
nrow(df_agg)
df_agg
}
We then define a function to normalize the probability of a given wordform relative to the sum of probabilities of wordforms for that length.
normalize_probabilities = function(df) {
### For a given dataframe, normalize probabilities "p" relative to #words of a given elgnth.
## First get probabilities
df = df %>%
mutate(normalized_surprisal = heldout_surprisal/num_phones,
p = 10 ** heldout_log_prob)
# Calculate sum(p) for each syllable length bucket.
df_syl_sums = df %>%
group_by(num_sylls_est) %>%
summarise(total_prob = sum(p))
# Merge with aggregated data, and normalize
df = df %>%
left_join(df_syl_sums, on = "num_sylls_est") %>%
mutate(p_normalized = p / total_prob)
df
}
We now define a function to compute the expected number of homophones \(k_i\) for a given wordform \(w_i\). This will also calculate the delta between the real and expected amount.
compute_expected_homophones = function(df) {
## Compute expected homophones "k" of a wordform by multiplying normalized probability
## "p" by the number of meanings "M".
df = df %>%
# Expected number of "successes" (entries), minus 1
mutate(k = p_normalized * M - 1) %>%
mutate(delta = num_homophones - k)
df
}
We also define a function to run the main regression:
run_regression = function(df) {
# Initialize output
out = list()
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal)
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal)
# Run model comparison
comparison = anova(model_reduced, model_full)
df_comp = broom::tidy(comparison) %>%
na.omit()
# Tidy model output
df_model = broom::tidy(model_full)
# Add to list parameters
out$model = df_model
out$comparison = df_comp
out
}
And several plotting functions:
plot_outcome = function(df_output) {
df_output$model$predictor = fct_recode(
df_output$model$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency"
# "Neighborhood Size" = "neighborhood_size"
)
### Plots outcome of regression
g = df_output$model %>%
ggplot(aes(x = predictor,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_minimal()
g
}
plot_comparison = function(df) {
# Plots expected vs. actual per each wordform.
g = df %>%
ggplot(aes(x = k,
y = num_homophones)) +
geom_point(alpha = .5) +
scale_x_continuous(limits = c(-1, max(df$k))) +
scale_y_continuous(limits = c(0, max(df$k))) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Expected number of homophones",
y = "Actual number of homophones") +
theme_minimal()
g
}
plot_bins = function(df, n, column, label) {
# Plots delta ~ frequency bins.
df$binned = as.numeric(ntile(pull(df, column), n = n))
g = df %>%
group_by(binned) %>%
summarise(mean_delta = mean(delta),
se_delta = sd(delta) / sqrt(n())) %>%
ggplot(aes(x = binned,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - se_delta,
ymax = mean_delta + se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = label,
y = "Delta (Real - Expected)") +
geom_smooth() +
theme_minimal()
g
}
plot_contrast = function(df, bins = 20) {
df$frequency_binned = as.numeric(ntile(df$frequency, n = 20))
g = df %>%
mutate(expected = k,
actual = num_homophones) %>%
pivot_longer(c(expected, actual), names_to = "model", values_to = "homophones") %>%
ggplot(aes(x = frequency_binned,
y = homophones,
color = model)) +
stat_summary (fun = function(x){mean(x)},
fun.min = function(x){mean(x) - sd(x)/sqrt(length(x))},
fun.max = function(x){mean(x) + sd(x)/sqrt(length(x))},
geom= 'pointrange') +
# position=position_dodge(width=0.95)) +
labs(x = "Binned Frequency",
y = "Number of Homophones") +
theme_bw()
g
}
First, we load and process the data.
## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CobLog = col_double(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## Class = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 7706. 7706
## 2 2 15247. 15247
## 3 3 11379 11379
## 4 4 5316 5316
## 5 5 1694 1694
## 6 6 439 439
## 7 7 95 95
## 8 8 10 10
## 9 10 1 1
We then merge with frequency data.
df_subtlex = read_csv("../data/frequency/english/SUBTLEXusfrequencyabove1.csv")
## Parsed with column specification:
## cols(
## Word = col_character(),
## FREQcount = col_double(),
## CDcount = col_double(),
## FREQlow = col_double(),
## Cdlow = col_double(),
## SUBTLWF = col_double(),
## Lg10WF = col_double(),
## SUBTLCD = col_double(),
## Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 60384
df_subtlex$Word = tolower(df_subtlex$Word)
df_merged_english = df_english %>%
inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_english)
## [1] 24307
df_merged_english$frequency = df_merged_english$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_english %>%
plot_comparison()
df_merged_english %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_english %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.59 0.425 -15.5 4.55e- 54
## 2 frequency -0.655 0.0837 -7.82 5.29e- 15
## 3 num_sylls_est 0.674 0.0791 8.52 1.71e- 17
## 4 normalized_surprisal 3.92 0.154 25.4 1.95e-140
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24303 2731590. 1 6882. 61.2 5.29e-15
Directly contrast the relationships:
df_merged_english %>%
plot_contrast(bins = 20)
df_english_lemmas = read_delim("../data/frequency/english/lemmas.csv", delim = "\\",
quote = "")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Cob = col_double()
## )
nrow(df_english_lemmas)
## [1] 52447
df_english_lemmas = df_english_lemmas %>%
mutate(freq_adjusted = Cob + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_english_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
total_frequency = mean(total_frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_english = df_merged_english %>%
# inner_join(df_entropy, by = "PhonDISC")
inner_join(df_entropy, by = "PhonDISC")
nrow(filter(df_merged_english, num_homophones == 1))
## [1] 3550
model_full = lm(data = filter(df_merged_english, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_english, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -245.772 -0.664 1.137 2.715 7.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.8108 1.0916 -7.155 1.01e-12 ***
## entropy 1.1216 0.7343 1.527 0.126760
## frequency -1.1569 0.1831 -6.318 2.99e-10 ***
## num_sylls_est 0.7758 0.2330 3.330 0.000878 ***
## normalized_surprisal 5.1642 0.3501 14.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.384 on 3545 degrees of freedom
## Multiple R-squared: 0.07563, Adjusted R-squared: 0.07458
## F-statistic: 72.51 on 4 and 3545 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_english, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3546 312343
## 2 3545 312138 1 205.41 2.3328 0.1268
First, we load and process the data.
df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## PhonStrsDISC = col_character(),
## PhonCVBr = col_character(),
## PhonSylBCLX = col_character(),
## PhonStrsStDISC = col_character(),
## PhonStCVBr = col_character(),
## PhonSylStBCLX = col_character(),
## PhonolCLX = col_character(),
## PhonolCPA = col_character(),
## PhonDISC = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 4672 4672
## 2 2 20588 20588
## 3 3 26420 26420
## 4 4 11338 11338
## 5 5 3394 3394
## 6 6 846. 846
## 7 7 189. 189
## 8 8 28 28
## 9 9 2 2
We then merge with frequency data.
df_subtlex = read_delim("../data/frequency/dutch/SUBTLEX-NL.txt", delim="\t")
## Parsed with column specification:
## cols(
## Word = col_character(),
## FREQcount = col_double(),
## CDcount = col_double(),
## FREQlow = col_double(),
## CDlow = col_double(),
## FREQlemma = col_double(),
## SUBTLEXWF = col_double(),
## Lg10WF = col_double(),
## SUBTLEXCD = col_double(),
## Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 437503
df_subtlex$Word = tolower(df_subtlex$Word)
df_merged_dutch = df_dutch %>%
inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_dutch)
## [1] 28330
df_merged_dutch$frequency = df_merged_dutch$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_dutch %>%
plot_comparison()
df_merged_dutch %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_dutch %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.64 0.834 -7.96 1.73e-15
## 2 frequency -3.38 0.170 -19.8 4.90e-87
## 3 num_sylls_est 0.289 0.164 1.77 7.75e- 2
## 4 normalized_surprisal 6.58 0.328 20.1 7.51e-89
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28326 15675501. 1 217909. 394. 4.90e-87
Directly contrast the relationships:
df_merged_dutch %>%
plot_contrast(bins = 20)
df_dutch_lemmas = read_delim("../data/frequency/dutch/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Inl = col_double()
## )
nrow(df_dutch_lemmas)
## [1] 123926
df_dutch_lemmas = df_dutch_lemmas %>%
mutate(freq_adjusted = Inl + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_dutch_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_dutch = df_merged_dutch %>%
inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_dutch)
## [1] 27615
nrow(filter(df_merged_dutch, num_homophones == 1))
## [1] 1258
model_full = lm(data = filter(df_merged_dutch, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_dutch, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -560.62 -2.43 2.57 6.89 20.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.3964 3.7514 -1.705 0.0884 .
## entropy 0.6487 2.8147 0.230 0.8178
## frequency -4.8754 0.6750 -7.223 8.83e-13 ***
## num_sylls_est -1.1844 0.9042 -1.310 0.1905
## normalized_surprisal 9.7822 1.2843 7.617 5.11e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.6 on 1253 degrees of freedom
## Multiple R-squared: 0.0873, Adjusted R-squared: 0.08438
## F-statistic: 29.96 on 4 and 1253 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_dutch, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1254 758586
## 2 1253 758554 1 32.161 0.0531 0.8178
df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 2454 2454
## 2 2 13181 13181
## 3 3 19429 19429
## 4 4 10983 10983
## 5 5 3971 3971
## 6 6 1240 1240
## 7 7 328 328
## 8 8 72 72
## 9 9 16 16
## 10 10 2 2
We then merge with frequency data.
df_freq = read_csv("../data/frequency/german/SUBTLEX-DE.csv")
## Parsed with column specification:
## cols(
## Word = col_character(),
## `Spell check` = col_double(),
## FREQcount = col_double()
## )
nrow(df_freq)
## [1] 377524
# df_freq$Word = tolower(df_freq$Word)
# log10 frequency (+ 1). SHould be analogous to measures from English.
df_freq$Lg10WF = log10(df_freq$FREQcount + 1)
df_merged_german = df_german %>%
inner_join(df_freq, on = "Word")
## Joining, by = "Word"
nrow(df_merged_german)
## [1] 24350
df_merged_german$frequency = df_merged_german$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_german %>%
plot_comparison()
df_merged_german %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_german %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.76 0.399 -12.0 7.99e- 33
## 2 frequency -1.67 0.0888 -18.8 3.12e- 78
## 3 num_sylls_est 0.316 0.0731 4.32 1.55e- 5
## 4 normalized_surprisal 4.19 0.157 26.7 2.36e-154
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24346 2750831. 1 39903. 353. 3.12e-78
Directly contrast the relationships:
df_merged_german %>%
plot_contrast(bins = 20)
df_german_lemmas = read_delim("../data/frequency/german/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Mann = col_double()
## )
nrow(df_german_lemmas)
## [1] 51728
df_german_lemmas = df_german_lemmas %>%
mutate(freq_adjusted = Mann + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_german_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_german = df_merged_german %>%
inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_german)
## [1] 23812
nrow(filter(df_merged_german, num_homophones == 1))
## [1] 740
model_full = lm(data = filter(df_merged_german, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_german, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -257.105 -0.305 3.717 6.664 13.163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.8831 4.2261 -3.285 0.00107 **
## entropy 0.6983 3.2098 0.218 0.82785
## frequency -2.0243 0.6702 -3.021 0.00261 **
## num_sylls_est 0.7451 0.9702 0.768 0.44270
## normalized_surprisal 9.0383 1.3466 6.712 3.84e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.63 on 735 degrees of freedom
## Multiple R-squared: 0.07683, Adjusted R-squared: 0.07181
## F-statistic: 15.29 on 4 and 735 DF, p-value: 5.087e-12
model_reduced = lm(data = filter(df_merged_german, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 736 283102
## 2 735 283084 1 18.226 0.0473 0.8278
df_french = read_data("../data/processed/french/reals/french_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `1_ortho` = col_character(),
## `2_phon` = col_character(),
## `3_lemme` = col_character(),
## `4_cgram` = col_character(),
## `5_genre` = col_character(),
## `6_nombre` = col_character(),
## `11_infover` = col_character(),
## `17_cvcv` = col_character(),
## `18_p_cvcv` = col_character(),
## `23_syll` = col_character(),
## `25_cv-cv` = col_character(),
## `26_orthrenv` = col_character(),
## `27_phonrenv` = col_character(),
## `28_orthosyll` = col_character(),
## `29_cgramortho` = col_character(),
## `34_morphoder` = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_french)
## [1] 37278
## Now normalize probabilities and compute expected homophones per wordform
df_french = df_french %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_french %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 3893 3893
## 2 2 13794 13794
## 3 3 15258 15258
## 4 4 7589 7589
## 5 5 2466 2466
## 6 6 670 670
## 7 7 97 97
## 8 8 12 12
## 9 9 3 3
French already has frequency data, so we just rename the column.
Now we can visualize the outcome in a variety of ways:
df_french$frequency = log10(df_french$`10_freqlivres` + 1)
df_french %>%
plot_comparison()
df_french %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_french %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.82 0.262 -22.2 3.01e-108
## 2 frequency -0.940 0.104 -9.00 2.41e- 19
## 3 num_sylls_est 0.580 0.0474 12.2 2.38e- 34
## 4 normalized_surprisal 3.19 0.109 29.3 5.81e-187
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37274 2687005. 1 5836. 81.0 2.41e-19
Directly contrast the relationships:
df_french %>%
plot_contrast(bins = 20)
df_japanese = read_data("../data/processed/japanese/reals/japanese_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## orth_form_kanji = col_character(),
## orth_form_hiragana = col_character(),
## orth_form_romaji = col_character(),
## phonetic_form = col_character(),
## morph_form = col_character(),
## glosses = col_character(),
## multiple_pronunications = col_logical(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_japanese)
## [1] 40449
## Now normalize probabilities and compute expected homophones per wordform
df_japanese = df_japanese %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_japanese %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 17 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 1011 1011
## 2 2 7710. 7710
## 3 3 14816 14816
## 4 4 14237 14237
## 5 5 6619 6619
## 6 6 3673 3673
## 7 7 1657 1657
## 8 8 733 733
## 9 9 345 345
## 10 10 185 185
## 11 11 72 72
## 12 12 52 52
## 13 13 19 19
## 14 14 8 8
## 15 15 7.00 7
## 16 16 2 2
## 17 22 1 1
Japanese already has frequency data, but we need to make sure it's summed across lemmas.
df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## orth_form_kanji = col_character(),
## orth_form_hiragana = col_character(),
## orth_form_romaji = col_character(),
## phonetic_form = col_character(),
## morph_form = col_character(),
## frequency = col_double(),
## glosses = col_character(),
## multiple_pronunications = col_logical(),
## phonetic_remapped = col_character(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = sum(frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_japanese_lemmas)
## [1] 40449
df_japanese_merged = df_japanese %>%
inner_join(df_japanese_lemmas, by = "phonetic_remapped")
nrow(df_japanese)
## [1] 40449
nrow(df_japanese_merged)
## [1] 40449
df_japanese_merged$frequency = log10(df_japanese_merged$total_frequency + 1)
Now, visualize the relationship.
df_japanese_merged %>%
plot_comparison()
df_japanese_merged %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_japanese_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.79 0.144 -26.3 3.12e-151
## 2 frequency -1.68 0.0896 -18.8 3.48e- 78
## 3 num_sylls_est 0.116 0.0183 6.35 2.19e- 10
## 4 normalized_surprisal 2.95 0.0851 34.7 1.48e-259
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40445 1203990. 1 10476. 352. 3.48e-78
Directly contrast the relationships:
df_japanese_merged %>%
plot_contrast(bins = 20)
df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## orth_form_kanji = col_character(),
## orth_form_hiragana = col_character(),
## orth_form_romaji = col_character(),
## phonetic_form = col_character(),
## morph_form = col_character(),
## frequency = col_double(),
## glosses = col_character(),
## multiple_pronunications = col_logical(),
## phonetic_remapped = col_character(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
mutate(freq_adjusted = frequency + 1) %>%
# mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
group_by(phonetic_remapped) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_japanese_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_japanese_entropy = df_japanese_merged %>%
inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_japanese_entropy)
## [1] 40449
nrow(filter(df_japanese_entropy, num_homophones == 1))
## [1] 4034
model_full = lm(data = filter(df_japanese_entropy, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_japanese_entropy, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -229.639 -0.132 1.238 2.042 5.530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.75904 1.18277 -4.869 1.16e-06 ***
## entropy 2.35183 1.48902 1.579 0.1143
## frequency -0.89102 0.37742 -2.361 0.0183 *
## num_sylls_est 0.07058 0.11106 0.636 0.5251
## normalized_surprisal 3.42456 0.32391 10.573 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.442 on 4029 degrees of freedom
## Multiple R-squared: 0.03562, Adjusted R-squared: 0.03466
## F-statistic: 37.2 on 4 and 4029 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_japanese_entropy, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4030 167317
## 2 4029 167214 1 103.53 2.4946 0.1143
df_mandarin = read_data("../data/processed/mandarin/reals/mandarin_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Key = col_character(),
## `PY+T` = col_character(),
## `Pho+T` = col_character(),
## `IPA+T` = col_character(),
## `PY-T` = col_character(),
## `Pho-T` = col_character(),
## `IPA-T` = col_character(),
## Tone = col_character(),
## SyStruct = col_character(),
## Words = col_character(),
## `Dominate-POS` = col_character(),
## `Other-POSes` = col_character(),
## Neighbors = col_character(),
## file = col_character(),
## homophonous_words = col_character(),
## word = col_character(),
## glides_remapped = col_character(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin)
## [1] 45871
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin = df_mandarin %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
nrow(df_mandarin)
## [1] 45871
## Double-check that this equals correct amount in lexicon
df_mandarin %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 4658 4658
## 2 2 32372 32372
## 3 3 9186 9186
## 4 4 5336 5336
## 5 5 312 312
## 6 6 97 97
## 7 7 56 56
## 8 8 17 17
## 9 9 3 3
## 10 10 6 6
## 11 11 3 3
## 12 14 1 1
Mandarin already has frequency data, but we need to make sure it's summed across lemmas.
df_mandarin_lemmas = read_csv("../data/processed/mandarin/reals/mandarin_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Key = col_character(),
## `PY+T` = col_character(),
## `Pho+T` = col_character(),
## `IPA+T` = col_character(),
## `PY-T` = col_character(),
## `Pho-T` = col_character(),
## `IPA-T` = col_character(),
## Tone = col_character(),
## SyStruct = col_character(),
## Words = col_character(),
## `Dominate-POS` = col_character(),
## `Other-POSes` = col_character(),
## Neighbors = col_character(),
## file = col_character(),
## homophonous_words = col_character(),
## word = col_character(),
## glides_remapped = col_character(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
df_mandarin_lemmas = df_mandarin_lemmas %>%
mutate(frequency = FreqPM * 1000000) %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = sum(frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_lemmas)
## [1] 45871
df_mandarin_merged = df_mandarin %>%
inner_join(df_mandarin_lemmas, by = "phonetic_remapped")
nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 45871
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency + 1)
Now we can visualize the outcome in a variety of ways:
df_mandarin_merged %>%
plot_comparison()
df_mandarin_merged %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_mandarin_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.77 0.325 -11.6 4.16e-31
## 2 frequency -0.284 0.0548 -5.19 2.10e- 7
## 3 num_sylls_est 0.0996 0.0625 1.59 1.11e- 1
## 4 normalized_surprisal 4.37 0.256 17.1 2.67e-65
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45867 4453802. 1 2617. 26.9 0.000000210
Directly contrast the relationships:
df_mandarin_merged %>%
plot_contrast(bins = 20)
Here, we calculate the Homophony Delta for Mandarin Chinese, using the Chinese Lexical Database.
df_mandarin_cld = read_data("../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 712019 parsing failures.
## row col expected actual file
## 30786 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## ..... ........... .................. ...... ...............................................................................
## See problems(...) for more details.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_cld)
## [1] 41009
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin_cld = df_mandarin_cld %>%
normalize_probabilities() %>%
compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
nrow(df_mandarin_cld)
## [1] 41009
## Double-check that this equals correct amount in lexicon
df_mandarin_cld %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 3648 3648
## 2 2 31637. 31637
## 3 3 6930 6930
## 4 4 3337 3337
Mandarin already has frequency data, but we need to make sure it's summed across lemmas.
df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
## row col expected actual file
## 35286 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = sum(FrequencyRaw))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_lemmas)
## [1] 41009
df_mandarin_merged = df_mandarin_cld %>%
inner_join(df_mandarin_lemmas, by = "phonetic_remapped")
nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 41009
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency)
Now we can visualize the outcome in a variety of ways:
df_mandarin_merged %>%
plot_comparison()
df_mandarin_merged %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_mandarin_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.85 0.0987 -18.7 5.10e-78
## 2 frequency -0.275 0.0156 -17.6 8.29e-69
## 3 num_sylls_est 0.00764 0.0221 0.345 7.30e- 1
## 4 normalized_surprisal 2.71 0.0683 39.7 0.
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 41005 303218. 1 2281. 309. 8.29e-69
Directly contrast the relationships:
df_mandarin_merged %>%
plot_contrast(bins = 20)
df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
## row col expected actual file
## 35286 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
mutate(freq_adjusted = FrequencyRaw + 1) %>%
# mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
group_by(phonetic_remapped) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_mandarin_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_mandarin_entropy = df_mandarin_merged %>%
inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_mandarin_entropy)
## [1] 41009
nrow(filter(df_mandarin_entropy, num_homophones == 1))
## [1] 1871
model_full = lm(data = filter(df_mandarin_entropy, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_mandarin_entropy, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.710 -0.380 0.886 1.827 4.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7619 0.9007 -3.067 0.00220 **
## entropy -1.3703 0.4268 -3.211 0.00135 **
## frequency -0.7130 0.1257 -5.670 1.65e-08 ***
## num_sylls_est 0.5328 0.2405 2.216 0.02683 *
## normalized_surprisal 4.4816 0.4227 10.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.655 on 1866 degrees of freedom
## Multiple R-squared: 0.07813, Adjusted R-squared: 0.07615
## F-statistic: 39.54 on 4 and 1866 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_mandarin_entropy, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1867 25067
## 2 1866 24929 1 137.71 10.308 0.001347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below, we combine the lexica so that we can visualize the central relationships in the same plot.
df_merged_english_f = df_merged_english %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'English') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_merged_dutch_f = df_merged_dutch %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Dutch') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_merged_german_f = df_merged_german %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'German') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_french_f = df_french %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'French') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_japanese_f = df_japanese_merged %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Japanese') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_mandarin_cld_f = df_mandarin_merged %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Mandarin') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_all_lexica = df_merged_english_f %>%
rbind(df_merged_dutch_f) %>%
rbind(df_merged_german_f) %>%
rbind(df_french_f) %>%
rbind(df_japanese_f) %>%
rbind(df_mandarin_cld_f)
PlotTheme = theme(
axis.title.x = element_text(size = 16),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.y = element_text(size = 16),
strip.text.x = element_text(size = 16),
title = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16))
df_all_lexica %>%
group_by(binned_frequency, language) %>%
summarise(mean_delta = mean(delta),
se_delta = sd(delta) / sqrt(n())) %>%
ggplot(aes(x = binned_frequency,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - 2 * se_delta,
ymax = mean_delta + 2 *se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = "Binned Frequency",
y = "Delta (Real - Expected)") +
geom_smooth() +
theme_bw() +
facet_wrap(~language) +
PlotTheme
## `summarise()` regrouping output by 'binned_frequency' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/delta.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
mutate(Baseline = k,
Real = num_homophones) %>%
pivot_longer(c(Baseline, Real),
names_to = "Lexicon",
values_to = "homophones") %>%
group_by(binned_frequency, language, Lexicon) %>%
summarise(mean_homophones = mean(homophones),
se_homophones = sd(homophones) / sqrt(n())) %>%
ggplot(aes(x = binned_frequency,
y = mean_homophones,
color = Lexicon)) +
geom_point(stat = "identity", size = .5, alpha = .5) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones,
ymax = mean_homophones + 2 * se_homophones),
width=.2) +
labs(x = "Binned Frequency",
y = "Number of Homophones") +
geom_smooth(alpha = .6) +
theme_bw() +
facet_wrap(~language) +
PlotTheme +
theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` regrouping output by 'binned_frequency', 'language' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/direct_comparison_frequency.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
mutate(Baseline = k,
Real = num_homophones) %>%
pivot_longer(c(Baseline, Real),
names_to = "Lexicon",
values_to = "homophones") %>%
group_by(binned_neighorhood_size, language, Lexicon) %>%
summarise(mean_homophones = mean(homophones),
se_homophones = sd(homophones) / sqrt(n())) %>%
ggplot(aes(x = binned_neighorhood_size,
y = mean_homophones,
color = Lexicon)) +
geom_point(stat = "identity", size = .5, alpha = .5) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones,
ymax = mean_homophones + 2 * se_homophones),
width=.2) +
labs(x = "Binned Neighborhood Size",
y = "Number of Homophones") +
geom_smooth(alpha = .6) +
theme_bw() +
facet_wrap(~language) +
PlotTheme +
theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` regrouping output by 'binned_neighorhood_size', 'language' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/direct_comparison_neighbors.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
return_regression_table = function(df) {
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal)
# Tidy model output
df_model = broom::tidy(model_full)
df_model
}
### Get coefficients for each language
english_coefficients = df_merged_english_f %>%
return_regression_table() %>%
mutate(language = 'English')
dutch_coefficients = df_merged_dutch_f %>%
return_regression_table() %>%
mutate(language = 'Dutch')
german_coefficients = df_merged_german_f %>%
return_regression_table() %>%
mutate(language = 'German')
french_coefficients = df_french_f %>%
return_regression_table() %>%
mutate(language = 'French')
japanese_coefficients = df_japanese_f %>%
return_regression_table() %>%
mutate(language = 'Japanese')
mandarin_coefficients = df_mandarin_cld_f %>%
return_regression_table() %>%
mutate(language = 'Mandarin')
# Combine into single dataframe
df_all_coefficients = english_coefficients %>%
rbind(dutch_coefficients) %>%
rbind(german_coefficients) %>%
rbind(french_coefficients) %>%
rbind(japanese_coefficients) %>%
rbind(mandarin_coefficients)
df_all_coefficients$predictor = fct_recode(
df_all_coefficients$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency"
)
### Plots outcome of regression
df_all_coefficients %>%
filter(predictor != "(Intercept)") %>%
ggplot(aes(x = language,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_bw() +
facet_wrap(~predictor) +
PlotTheme
### Save to figure
ggsave("../Figures/coefficients_faceted_factor.png", dpi = 300)
## Saving 7 x 5 in image
Across all languages tested, it appears that Frequency exhibits a negative relationship with Homophony Delta: more frequent wordforms are less homophonous than one would expect purely on the basis of their phonotactics.
But there are other predictions, again borne out by previous research (Trott & Bergen, 2020), about which areas of the lexicon may see the greatest increase or reduction in homophony. One example is Neighborhood Size. Wordforms with more neighbors might see a corresponding decrease in homophony for one of two reasons. These explanations are not mutually incompatible, though they do suppose distinct causal pathways for the reduction in homophony:
If larger neighborhoods already increase the possibility of perceptual confusion (e.g., "bat" vs. "pat"), one might expect that, wordform \(w_i\) might have fewer homophones than one would expect on account of its phonotactic probability \(p_i\) if \(w_i\) already has a number of neighbors.
If languages select against over-saturation (Trott & Bergen, 2020), and if this selection process ultimately results in larger neighborhood sizes, one would expect that the areas of the lexicon that have undergone the most negative selection relative to their phonotactics should have correspondingly larger neighborhoods.
run_regression_with_neighborhood_size = function(df) {
# initialize output
out = list()
# Log neighborhood
df$log_neighborhood = log10(df$neighborhood_size + 1)
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal + log_neighborhood)
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal + frequency)
# Run model comparison
comparison = anova(model_reduced, model_full)
df_comp = broom::tidy(comparison) %>%
na.omit()
# Tidy model output
df_model = broom::tidy(model_full)
# Add to list parameters
out$model = df_model
out$comparison = df_comp
out
}
plot_neighborhood_bins_residualized = function(df, n) {
### Plot residuals of delta ~ #sylls + surprisal + frequency against neighborhood size
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal + frequency)
df$resid = residuals(model_reduced)
# Plots delta ~ frequency bins.
df$neighborhood_binned = as.numeric(ntile(df$neighborhood_size, n = n))
g = df %>%
group_by(neighborhood_binned) %>%
summarise(mean_delta = mean(resid),
se_delta = sd(resid) / sqrt(n())) %>%
ggplot(aes(x = neighborhood_binned,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - se_delta,
ymax = mean_delta + se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = "Binned Neighborhood Size",
y = "Residuals (Reduced Model)") +
geom_smooth() +
theme_minimal()
g
}
plot_outcome = function(df_output) {
df_output$model$predictor = fct_recode(
df_output$model$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency",
"Log(Neighborhood Size)" = "log_neighborhood"
)
### Plots outcome of regression
g = df_output$model %>%
ggplot(aes(x = predictor,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_minimal()
g
}
df_merged_english %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_english %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_english %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.28 0.529 -9.99 1.98e- 23
## 2 frequency -0.455 0.0938 -4.85 1.25e- 6
## 3 num_sylls_est 0.235 0.121 1.94 5.21e- 2
## 4 normalized_surprisal 3.91 0.171 22.9 1.57e-114
## 5 log_neighborhood -1.69 0.234 -7.21 5.67e- 13
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20844 2524551. 1 6301. 52.0 5.67e-13
df_merged_dutch %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_dutch %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_dutch %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.43 0.912 -7.05 1.82e-12
## 2 frequency -3.30 0.183 -18.0 3.74e-72
## 3 num_sylls_est 0.160 0.190 0.838 4.02e- 1
## 4 normalized_surprisal 6.80 0.338 20.1 2.57e-89
## 5 log_neighborhood -0.797 0.449 -1.78 7.58e- 2
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 27610 15647403. 1 1786. 3.15 0.0758
df_merged_german %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_german %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_german %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.85 0.439 -8.76 2.07e- 18
## 2 frequency -1.55 0.0928 -16.7 4.98e- 62
## 3 num_sylls_est 0.0842 0.0835 1.01 3.13e- 1
## 4 normalized_surprisal 4.19 0.162 25.9 2.06e-146
## 5 log_neighborhood -1.67 0.253 -6.61 3.96e- 11
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 23807 2739371. 1 5026. 43.7 3.96e-11
df_french %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_french %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_french %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.47 0.294 -8.41 4.12e- 17
## 2 frequency -0.505 0.105 -4.80 1.58e- 6
## 3 num_sylls_est -0.143 0.0555 -2.58 1.00e- 2
## 4 normalized_surprisal 3.06 0.108 28.3 2.65e-174
## 5 log_neighborhood -3.47 0.141 -24.5 5.49e-132
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37273 2644265. 1 42740. 602. 5.49e-132
df_japanese_merged %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_japanese_merged %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_japanese_merged %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.280 0.177 -1.58 1.14e- 1
## 2 frequency -1.26 0.0893 -14.1 6.63e- 45
## 3 num_sylls_est -0.396 0.0238 -16.7 2.79e- 62
## 4 normalized_surprisal 2.70 0.0843 32.1 5.01e-223
## 5 log_neighborhood -2.46 0.0740 -33.2 1.07e-238
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40444 1172019. 1 31970. 1103. 1.07e-238
df_mandarin_merged %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_mandarin_merged %>%
plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_mandarin_merged %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.284 0.109 2.61 9.18e- 3
## 2 frequency -0.178 0.0155 -11.5 1.88e- 30
## 3 num_sylls_est -0.595 0.0259 -23.0 3.83e-116
## 4 normalized_surprisal 2.42 0.0673 35.9 1.17e-278
## 5 log_neighborhood -2.10 0.0496 -42.4 0.
output$comparison
## # A tibble: 1 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 41004 290500. 1 12718. 1795. 0